DOCUMENT RESUME 



ED 213 035 

AUTHOR 
TITLE 

INSTITUTION 

SPONS AGENCY 

PUB DATE 
GRANT 
NOTE 

EDRS PRICE 
DESCRIPTORS 



CS 206 751 

Spandel, Vicki; Stiggins, Richard J. 
Direct Measure* of Writing Skill: Issues and 
Applications. Revised Edition. 

Northwest Regional Educational Laboratory, Portland, 
OR. Clearinghouse for Applied Performance Testing. 
National Inst, of Education (DHEW) , Washington, 
D.C. 
Aug 81 

OB-NIE-G-78-0206 

62p.; For related document see ED 196 038. 
MF01/PC03 Plus Postago. 

♦Educational Assessment; Educational Planning; 
Elementary Secondary Education; 'Evaluation Methods; 
♦National Surveys; Testing; Test Results; *Test Use; 
♦Writing (Composition); *Writing Evaluation 

ABSTRACT 

. . Intended for educators seeking information on direct 

writing assessments, this monograph describes general procedures for 
planning and conducting a writing assessment and strategies for 
tailoring that assessment to local needs. The introductory chapter 
offers a brief comparison of direct and indirect writing assessment 
methods, highlighting those features of direct assessment that make 
it the most popular approach. The status of writing assessment in 
American education is then summarized with emphasis on cur- *nt 
patterns and developmental trends. The second chapter presents an 
overview of direct writing assessment procedures, touching on 
considerations in maximizing test quality, strategies for exercise 
development and alternative scoring approaches. The third chapter 
discusses selection of a writing assessment approach to suit a 
specific educational context such as program evaluation or student 
screening. Appended is a profile of statewide writing assessment, 
which provides in! nation on the use of objective tests and or 
writing exercises, the developers of such exercises, kinds of writing 
assessed, kind of scoring method used, those using the results, and 
contact persons. (HOD) 



************************ *********************************************** 
* Reproductions supplied by EDRS are the best that can be made * 

from the original document. * 
********************************** ************************************* 



erJc 



U A MPAATMf NT Of IDUCATION 

national institute of eoucation 

COUCATIONAL RESOURCES INFORMATION 
^ / . CENTER (ERIC) 

A"* document has bwm reproduced as 
received from the person or organuation 
Originating \* 

M.nor chenoes have been made to improve 
reproduction quality 

• Points of view or opmtons $ ta ted in this docu 
ment do not necessary represent officii NIE 
position or policy 



Direct Measures 
of Writing Skill: 

Issues and Applications 

REVISED EDITION 

Viclu Spandel 
Richard J. Stiggins 

"PERMISSION tq REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 
NREL 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC) " 



CAPT 

Clearinghouse for Applied Performance Testing 

NORTHWEST REGIONAL EDUCATIONAL LABORATORY 
300 S.W. Sixth Avenue 
Portland, Oregon 97204 



January 1980 

August 1981 (Revised Edition) 



This work is published by the Clearinghouse for Applied Performance 
Testing (C * PT) of the Northwest Regional Educational Laboratory, a pri- 
vate nonprofit corporation. The work contained herein has been devel- 
oped under grant OB-NIE-G-78-0206 with the National Institute of Edu- 
cation (NIE), Department of Health, Education and Welfare. The 
opinions expressed in this publication do not necessarily reflect the posi- 
tion of the National Institute of Education, and no official endorsement by 
the institute should be inferred. 



TABLE OF CONTENTS 



Page 

PREFACE v 

ACKNOWLEDGEMENTS vii 

CHAPTER I: Introduction to Writing Assessment . . . . 1 

Comparison of Direct and Indirect 

Writing Assessment 2 

Assessment Focus 3 

Practical Testing Considerations 3 

Characteristics of Test Exercises 4 

Judging Test Quality 5 

Comparing Assessment Options 5 

A Status Report on Writing 

Assessment Programs 6 

Still— No "Best" Answer 11 

CHAPTER II: An Overview of Direct Writing 

Assessment Procedures 13 

Ensuring High Quality Assessment 13 

Reliability 13 

Validity 15 

Developing Exercises 17 

Assessment planning 17 

Exercise development 18 

Review of specifications and exercises . .21 

Exercise pretesting 22 

Final exercise revision 23 

Procedures for Scoring Writing 

Samples 23 

Holistic scoring 23 

Analytical scoring 26 

Primary trait scoring 27 



ERLC 



4 



in 



Scoring language usage and 

mechanics 29 

T-unit analysis 31 

A Comparison of Scoring Methods 34 

CHAPTER III: Adapting Writing Assessment to 

Specific* Purposes 35 

Using Tests to Manage Instruction 36 

Using Tests to Select Students 36 

Using Tests to Evaluate Programs 37 

Selecting Examinees as a Function 

of Purpose 38 

De\ eloping Exercises as a Function 

of Purpose 38 

Selecting Scoring Procedures as a 
Function of Purpose 39 

Ensuring Efficient, Effective, and 

High Quality Assessment 42 

REFERENCES 43 

APPENDIX : Profiles of Statewide Writing 

Assessment 45 



5 

iv 



PREFACE 



The Clearinghouse for Applied Performance Testing 
(CAPT) has published a series of monographs on the assess- 
ment of writing proficiency. Direct Measures of Writing Skill: 
Issues and Applications, published in the first edition in Janu- 
ary 1980, was the initial volume in that series. It presented 
perspectives on writing assessment as they had developed 
through the 1970s and included results of a 1979 national 
survey of statewide writing assessment programs compiled 
by Vicki Frederick of the Wisconsin State Department of 
Education. 

This revised edition contains much of the same informa- 
tion f ')und in the original. However, the views presented have 
been updated to reflect two additional years of writing as- 
sessment research and development Included in this edition 
are summarized results of a 1981 national survey of state- 
wide and large-city school district wiiting assessment pro- 
grams compiled by Michael McCready and Virginia Melton 
of Louisiana Technological University. 

This monograph is written for educators interested in 
learning about procedures for the direct measurement of 
writing skills: that is, testing through the use of student writ- 
ing samples. Minimum attention is given in this volume to the 
indirect assessment of writing skills via objective language 
usage tests. Material presented herein is directly useable by 
educators at all levels, from elementary, junior high and high 
school to postsecondary and state department levels. * 

Those interested in additional information on writing as- 
sessment are directed to three recent CAPT publications : Us- 
ing Writing Assessment in the Classroom: A Teacaers Hand- 
booh A Directory of Writing Assessment Consultants and A 
Guide to Published Tests of Writing Proficiency. The former 
provides teachers with strategies for using writing assess- 
ment methods to teach writing skills. The latter provides con- 
sumer information on available published tests of writing 
skill, and sources of technical assistance in developing and 
implementing writing assessment program*. Anyone in- 
terested in obtaining these publications is urged to contact 
CAPT for further details. 



ERLC 



6 



V 



CAPT intends to continue its role in collecting, synthesiz- 
ing and disseminating information on writing assessment. 
Readers are encouraged to siomit comments and sugges- 
tions regarding this and other CAPT writing assessment 
publications. 



Richard J. Stiggins 
CAPT Coordinator 



V] 



7 



ACKNOWLEDGMENTS 



The authors wish to thank CAPT staff members Nancy 
Bridgeford and Carol DeWitte for their valuable assistance 
in the production of this revised edition of Direct Measures. 
Nancy played a key editorial and production coordination 
role, and Carol provided excellent secretarial assistance. 

Thanks are also due the NWREL Media Center for their 
production assistance and to the many people whose time 
and ideas contributed to development of the first edition. 

Special thanks to Mike McCready and Virginia Melton of 
the Teacher Education Department, Louisiana Technological 
University, Ruston, Louisiana, for their willingness to share 
the results of their valuable research on state and local writ- 
ing assessment programs. 



Vicki Spandel 
Richard J. Stiggins 



ERLC 



8 



VII 



CHAPTER I: Introduction to 
Writing Assessment 



Until recently, those concerned with the large-scale assess- 
ment of writing proficiency relied predominantly on objec- 
tive tests of language usage skill. Evidence of this fact can bt 
found in the language skills tests included in standardized 
achievement batteries offered by publishers over the past 40 
years, as well as in the language skills sections of the major 
national college entrance examinations. However, changing 
teacher attitudes and research and development efforts led 
by Edrcational Testing Service (ETS) and the National As- 
sessment of Educational Progress (NAEP) have combined to 
shift the focu_> of writing assessment away from objective 
tests, toward the use of writing samples as the basis for judg- 
ing proficiency. This new emphasis has been made possible 
in part through development of writing sample scoring pro- 
cedures capable of producing valid and reliable results in an 
efficient, often cost effective manner. 

The direct writing assessment techniques pioneered by 
ETS and NAEP are already being adopted by school dis- 
tricts, state education agencies, postsecondary institutions, 
and test publishers. Further, acknowledging the desirability 
of directly assessing writing proficiency, professional asso- 
ciations of English teachers are urging adoption of writing 
sample-based testing. 

As a result of these developments, many educators are 
seeking information on direct writing assessment. This 
monograph has been prepared to help meet their needs. It 



1 



ERLC 



9 



offers the interested educator the basic information required 
to use educationally sound assessments of writing 
proficiency. It does not, however, present step-by-step in- 
structions on how to measure writing skill. Those steps vary 
greatly from situation to situation, and should, whenever 
possible, be planned with the assistance of an experienced 
writing assessment consultant. The monograph does, how- 
ever, describe general procedures for planning and conduct- 
ing an assessment, and strategies for tailoring ihat assess- 
ment to local needs. Sources of additional information are 
also provided- 

This introductory chapter offers a brief comparison of di- 
rect and indirect writing assessment rhcthods, highlighting 
those features of direct. assessment thit make it the most 
popular approach. The status of writings assessment in 
American education is then summarized with emphasis on 
current patterns and developmental trends. 

Chapter 2 presents an overview of direct writing assess- 
ment procedures, touching on considerations in maximizing 
test quality, strategies for exercise development and alterna- 
tive scoring approaches. Chapter 3 discusses selection of a 
writing assessment approach to suit a specific educational 
context. 



A Comparison of Direct and Indirect Writing 
Assessment 

There are two v iable approaches to the assessment of writ- 
ing proficiency. One is the direct method. It relies on actual 
samples of student writing to judge writing proficiency. The 
second is the indirect method, which relies on objective tests. 
Research on the correlation between the two reveals a consis- 
tent and relatively strong relationship at various educational 
levels. Summarized here are six studies that correlated objec- 
tive language usage test scores with scores obtained on writ- 
ing sample-based assessments. 

The results of these studies suggest that the two ap- 
proaches assess at least some of the same performance fac- 
tors: yet each deals with some unique aspects of writing skill. 
These similarities and differences relate to assessment focus, 
practical aspects of testing, characteristics of test exercises 
and aspects of test quality * 

*Fora more detailed discussion of thes< favors as thev relate to direct and indirect 
assessment, see Stiggins (l f l81) 




10 



Researchers 



Students Tested N Correlation 



Codsnalk, Swineford iy 










< off man (1966) ' 




64 


4^ 
40- 




oreland. Colon & Kogosa 










(1976) 




Q6 


47 

4£ 




nreland cm Oaynor \ ly/y) 


ii 

College 


819 


.63 








895 


63 








517 


58 




Huntley, Schmeiser <k Stiggins 










( 1 979) 


^Oiit gc 






.0/ 


Hoganfc Mishler(1980) 


Third graders 


140 


68 






Eighth graders 


160 


65 




Mors, Cole & Khampahkit 


Fourth graders 


84 


20- 


.68 


(1981) 


Seventh graders 


45 


.60- 


67 




Tenth graders 


98 


72- 


76 



Assessment Focus. Direc and indirect writing assess- 
ments focus on different components of writing. Direct as- 
sessment measures actual composition skill. Indirect tests 
ability to use — or recognize proper use of — the conventions 
of effective writing: grammar, punctuation, sentence con- 
struction, organization, and so on. Direct assessment pro- 
vides necessary and sufficient information for drawing con- 
clusions regarding a student's writing proficiency. Indirect 
assessment, on the'other hand, provides necessary — but not 
always sufficient — information for evaluating a student's 
writing proficiency. 

An examination of traits measured in the two approaches 
reveals that indirect assessment tends to cover highly explicit 
constructs in which there are definite right and wrong re- 
sponses (e.g., grammar is either correct or it is not). Direct 
assessment, on the other hand, tends to measure less tangi- 
ble skihd (e.g., persuasiveness), for which the concept of right 
and wrong is less relevant. 

Practical Testing Considerations. Several important prac- 
tical matters related to testing suggest additional differences 
between direct and indirect assessment. 

For example, effective assessment requires appropriate 
attitudes on the part of test users. With direct assessment, 
users of the test results must be willing to invest the time, 
money and effort to conduct a writing assessment that calls 



ERLC 



11 



3 



for complex, often time consuming testing procedures. In the 
case of indirect assessment, users must be willing to accept a 
proxy measure: that is, a test that covers component skills of 
writing without actually requiring students to wrte. Given 
the appropriate attitudes, either direct or indirect assess- 
ment v ill most probably have its desired impact. If those atti- 
tudes are lacking, problems can be anticipated. 

In either direct or indirect assessment, the examiner has 
two choices for test acquisition: (a) selecting an already exist- 
ing test or (b) constructing a new test. 

If one decides to use previously developed exerc'ses and 
scoring criteria for direct assessment, then the following 
skills will be required of those conducting the assessment: (1) 
technical expertise in writing, to specify w' : h writing skills 
vv'll be assessed ; (2) test evaluation skills to investigate availa- 
ble options and select test items that measure the skills to be 
assessed; and (3) org ni/ational skills to set up, administer, 
score and report the results of the assessment. 

Selecting an already developed objective tost requires the 
expertise to determine the information needs of the test user 
and to review and select a valid and reliable test. In most 
cases, the user will also have to be skilled in interpreting and 
using norm-referenced standard scores. 

Developing a new direct instrument, which involves creat- 
ing * new set of exercises and criteria for scoring, also de- 
mands organizational skills and technical writing expertise 
In addition, however, psychometric expertise is required in 
order to ev aluate the validity and reliability of the assessment 
procedures, and refine exercises and criteria as necessary. 

Developing a new indirect assessment or objective test in- 
strument requires (1) technical expertise in writing to plan 
the assessment; (2) skill in item writing or selection? (3) or- 
ganizational skills to pilot test, analyze and select the new 
items; and (4) psychometric expertise to evaluate tho test's 
reliability and the validity. 

In short, developing new instruments for either testing ap- 
proach requires substantially more expertise and staff time 
than does using existing assessment instruments. 

Characteristics of Tes* Exercises. There are some funda- 
mental differences mi the kinds of test exercises used in direct 
and indirect writing assessment First, the exercises differ in 
form Direct assessment exercises generally take the form of 



12 



a short paragraph that invites tN> examinee to respond to a 
question, state an opinion, resolve an issue, explain a proe- 
ess, recount an event, or simply express his/her feelings. The 
exercise, if well constructed, identifies for the examinee the 
(1) form of writing to be produced, (2) audience to be ad- 
dressed, and (3) purpose for ih^ writing. Indirect assessment 
items frequently follow a multiple choice format, though fill- 
in questions are sometimes used. Various interlinear forms, 
as well as sentence combining items, are common. 

As a result of differences in format, direct assessment ex- 
ercises are considerably more flexible than indirect. With di- 
rect assessment, the stimulus can be auditory or visual and 
can be quite true to life (e.g., writing a job application letter). 
Indirect test items, on the other hand, are generally con- 
strained by the multiple choice (or other) format. Therefore, 
while direct assessment exercises can be made to closely ap- 
proximate "real world" writing, objective test items are 
somewhat more artificial. 

Judging Test Quality. The factors commonly considered in 
judging the psychometric adequacy of a test are reliability 
and validity. 

Reliability and validity considerations for direct and indi- 
rect measures are quite similar. In the case of direct meas- 
ures, score stability is important over time, across exercises, 
across test forms and across raters. Consistency across ra- 
ters is not an issue with indirect measures, however, since 
scoring is totally objective. In both cases, sources of inaccu- 
rate scores include poor test items and improper test ad- 
ministration. Sources of score inaccuracy unique to each ap- 
proach include: (1) guessing on indirect measures, and (2) 
poor or inconsistent scoring of direct measures. 

Validity considerations are similar. Content validity is rele- 
vant to both types of writing tests and should be yerified in 
both via expert judgment. Criterion related validity, also im- 
portant in both cases, can be verified through correlations 
with other indicators of writing proficiency. 

Comparing Assessment Options. Direct and indirect ap- 
proaches to writing assessment are perhaps best compared 
in terms of their relative advantages and disadvantages, and 
the primary ways in which each can be used. 

The major advantage* of the dircrt assessment option are 

5 



ERLC 



13 



(1) the extent of information provided about examinees* 
writing proficiency, (2) potentially high fidelity (autheniicity) 
of the exercise and response, (3) the adaptability of exercises 
to a variety of relevant real world writing circumstances, (4) 
high face validity, and (5) relatively low test development 
costs. 

The major advantages associated with the indirect assess- 
ment are (1) high score reliability, (2) relatively low test scor- 
ing costs, and (3) high degree of control over the nature ot the 
skills tested. 

The disadvantages of the direct method include (1) high 
scoring costs, and (2> the potential lack of uniformity among 
examinees regarding the proficiencies ass&sed. 

The disadvantages of the indirect method are (1) lack of 
fidelity to real world writing tasks, (2) heavy reliance on ex- 
aminees' reading rather than writing proficiency, and in 
many cases (3) lack of face validity in tbe objective measure. 

Writing assessment program developers would do well to 
keep these differences clearly in mind when planning an as- 
sessment program. 

A Status Report on Writing Assessment Programs 

In recent years, two national surveys of laige-scale writing 
assessment programs have been conducted. The first was 
conducted by Frederick (1979) under the auspices of the Wis- 
consin Pupil Assessment Program of the Wisconsin Depart- 
ment of Public Instruction, and the second was conducted in 
1981 by McCready and Melton of Louisiana Technological 
University with support from the National Institute of Educa- 
tion. Each survey, at the time it was conducted, provided very 
useful insights into the status of large-scale assessment, and 
taken together, the two surveys provide valuable perspec- 
tives regarding trends in writing assessment. 

In 1979, 18 states were conducting writing assessment pro- 
grams These assessments spanned t*ie full range of grade 
levels. The typical assessment at that time covered about 
three grade levels, relied solely on a w riting sample, or on a 
writing sample in combination with an objective test to judge 
proficiency, and involved holistic and/or primary trait scor- 
ing of the writing sample. 

The status of writing assessment in 1981 is summarized in 
Table 1. Note that 24 states are currently conducting writing 
assessments relying predominantly on writing samples 



\4 



Table 1 

Evolution of Writing Assessments 



1979 1981 1981 

State State City 

Assessments Assessments Assessments 

Conducting Assessments 18 24 20 



Testing in Grades 



K 


1 




1 


1 


1 


1 


7 


2 


1 


1 


8 


3 


1 


7 


10 


4 


7 


7 


9 


5 


3 


5 


10 


6 


1 


7 


11 


7 


2 


5 


9 


8 


10 


10 


12 


9 


4 


11 


15 


10 


2 


6 


11 


11 


13 


11 


13 


12 


5 


5 


9 



Mean Grades Tested Per 
State 


2.7 


3.2 




6.3 


T esting Strategy 














Objective only 


1 


6% 


1 


4% 


3 


5% 


Writing Sample only 


7 


39% 


12 


50% 


9 


45% 


Combination 


10 


55% 


11 


46% 


8 


40% 


Scoring Method 














Holistic 


6 


33% 


15 


65% 


8 


50% 


Analytical 


1 


6% 


1 


5% 


5 


31% 


Primary Trait 


6 


33% 


4 


17% 


0 


0% 


Combination 


5 


28% 


3 


13% 


3 


19% 



scored holistically. Brief profiles are presented for stateu ; de 
and large-city school district writing assessment programs. 
More detailed profiles are presented in the Appendix. The 
profiles are summarized in various ways in Table 2, which 
compares 1979 and 1C81 assessments. 

Several dimensions of this comparison are of interest. 
First, note that six more states have added assessment pro- 
grams over the past two years. Note also that while statewide 
assessments in both 1979 and 1981 tended to begin testing in 
grades three or four, city schools tend to conduct a good deal 



Table 2 

Overview of Large-scale Assessment Programs 



STATE 


Grade(s) 
Tested 


Type of 
Test 


Writing Sample 
Procedure 


Alabama 


^ Q 


w ruing sample 


Holistic 


Delaware 




Objective Test 
Writing Sample 


rnmary trait 


California 


3,6,12 


Objective Test 
Writing Sample 


Holistic 


Florida 


I 5,8,11 


Objective Test 
w ruing sample 


Analytical 


Hawaii 


4,8,11 


Writing Sample 


Holistic 


Idaho 


9 


Writing Sample 


Holistic 


Louisiana 


^ 7 in 


uojective lest 
Writing Sample 


Primary Trait 


Maine 


4,8,11 


Writing Sample 


Holistic 


Maryland 


9-12 


Objective Test 
Writing Sample 


Holistic 


Mflccflrhncptfc 

1*1 C» 1 1 U 3C l lo 


7,8, 9, 12 


Writinff Sam nip 


Analytical 


Michigan 


4,7, 10 


Writing Sample 


Primary Trait 


Minnesota 


4,8,11 


Writing Sample 


Primary Trait 


Nevada 


3,6,9-12 


Objective Test 
(3, 6) 

Writing Sample 


Holistic 


New 

Hampshire 


5, 9,12 


Writing Sample 


Holistic 


New Jersey 


9 


Objective Test 
Writing Sample 


Holistic 


New Mexico 


10 


Writing Sample 


Holistic 


North 
Carolina 


11 


Writing Sample 


Holistic 
Analytical 


Ohio 


8,12 


Objective Test 
Writing Sample 


Holistic 



16 



STATE 


Grade(s) 
Tested 


Type of 
Test 


Writing Sample 
Scoring 
Procedure 


urcgon 


4, 7, 1 1 


v^ojecuve lesi 
Writing Sample 


Holistic 


Pennsylvania 


5,8, 11 


Objective Test 




Rhode Island 


4,6,8, 10 


Objective Test 
rv ruing sample 


Holistic 


South 
Carolina 


6,8, 11 


Writing Sample 


Analytical 


Texas 


3,5,9 


Objective Test 
Writing Sample 


Holistic 


Wyoming 


6,9 


Writing Sample 


Holistic 


CITY 


Little Rock, 
AR 


1-11 


Objective Test 




Phoenix, AZ 


9-12 


Objective Test 
Writing Sample 


Analytical 


Monterey, CA 


1-12 


Writing Sample 


Holistic 


Tallahassee, 
FL 


1-8 


Objective Test 
Writing Sample 


Analytical 


Atlanta, GA 


1-12 


Objective Test 




Dcs Moines, 
IA 


9 


Writing Sample 


Holistic 
Analytical 


Chicago, IL 


9-12 


Writing Sample 


Analytical 


Boston, MA 


2,5,8 


Writing Sample 


Holistic 


Wichita, KS 


K-12 


Writing Sample 


Holistic 


Baltimore, 
MD 


1-9 


Objective Test 
Writing Sample 


Analytical 


Detroit, MI 


10-12 


Objective Test 
Writing Sample 


Holistic 


Raltigh, NC 


1-12 


Objective Test 
Writing Sample 


Teacher Option 



ERIC 



17 



STATE 


Grade(s) 
Tested 


Type of 
Test 


Writing Sample 
Scoring 
Procedure 


Albuquerque, 
NM 


4,6,9-12 


Objective Test 
(4,6,9) 

Writing Sample 


Holistic 


Santa Fe, NM 


7-12 


Writing Sample 


Holistic 
Analytical 


New York, NY 


8, 11 


Writing Sample 


Holistic 


Portland, OR 


3-9 


Objective Test 




Austin, TX 


3, 9 


Objective Test 
Writing Sample 


Holistic 


Madison, WI 


5, 8,11 


Objective Test 
Writing Sample 


Holistic 
Primary Trait 


Seattle, WA 


3, 6,9-11 


Writing Sample 


Analytical 


Laramie, WY 


6,9 


Writing Sample 


Holistic 



of writing assessment as far down as grades one and two. 
Most assessment, however is conducted in junior and senior 
high school. The average number of grade levels tested is on 
the increase in statewide assessments. But neither the li)79 
nor the 1981 averages on this variable compare to the city 
schools' average of 6.3 grade levels in each assessment. 

With regard to writing assessment method, there is a rela- 
tively constant pattern over time and across settings. Large- 
scale assessments tend to rely on writing samples alone or 
writing samples in combination with objective tests. Sole reli- 
ance on objective tests is rare. 

Procedures for rating writing perfornia ice have changed 
markedly over the past two years. In 1979. assessors tended 
to rely about equally on holistic and primary trait scoring. 
Little attention vyas paid to the analytical approach. In 1981, 
however, in both state and city programs, there has been a 
significant decline in the use of primary trait scoring, and a 
marked increase in the use of both holistic and analytical 
methods. 

In sum, significantly more writing assessment is being con- 
duetedin 1981 than in 1979. and that assessment relies heav- 
ily on holistieally scored samples of student writing as the 



10 



basis for judging proficiency. For more detail on assessment 
programs consult the Appendix. 



Still— No "Best" Answer 

These are but a few of the many instances in which writing 
assessment is being successfully conducted on national, state, 
and local levels. The remainder of this monograph describes 
(1) some of the procedures used in various assessment con- 
texts and (2) key measurement issues in the testing of writing 
skill. 

The assessment of writing skill is a very complex task, be- 
cause of the broad range of potentially relevant writing com- 
petencies and the difficulties in setting standards of accept- 
able performance. There is not nou\ nor will there ever be, a 
single best way to assess writing skill Each individual educa- 
tional assessment and writing circumstance presents unique 
problems to the developer and user of writing tests. Therefore, 
great care must be taken in selecting the approach and the 
methods to be used in each writing assessment. Methods used 
in one context to measure one set of relevant writing skills 
should not be generalized to other writing contexts without 
very* careful consideration of writing circumstances. 



11 

19 



ERIC 



Chapter II: An Overview of Direct 
Writing Assessment Procedures 



The development and implementation of a high quality 
writing assessment program can be complex and expensive. 
This chapter outlines procedures for managing that com- 
plexity and ensuring sound assessment. 

Ensuring High Quality Assessment 

Two key considerations in determining the quality of writ- 
ing assessment are the reliability and validity of the scores 
generated by the assessment. The exercise development and 
scoring procedures outlined in the following two sections of 
this chapter have been developed and refined specifically to 
ensure score reliability and validity. However, before 
describing those procedures, it may be useful to explain relia- 
bility and validity as they relate to direct writing assessment. 

Reliability. To be useful for educational decisions, tests 
must yield scores that are consistent or reliable. When scores 
are unreliable, the assessment results can lead to erroneous 
conclusions or decisions. In writing assessment, score incon- 
sistency can take any of several forms. 

For example, suppose a writing assessment were adminis- 
tered to the same students twice, the second administration 
following a two- to three-week interval. And suppose that 
even though no writing instruction took place, the scores ob- 
tained the second time were totally different from those 
achieved the first time for nearly every examinee. The exam- 



er|c 



20 • 



13 



mcr w ould not know which score (if cither) to depend on as 
the true reflection of the students' proficiency. Or suppose 
two writing exercises were developed to measure exactly the 
same skills and yet when both were administered to a stu- 
dent, the exercises resulted in totally different estimates of 
proficiency. Again, the examiner would not know which 
score w as the better indicator of proficiency. Or, from a third 
perspective, suppose two judges read and evaluated a w riting 
sample from the same student and drew totallv different con- 
clusions regarding the student's proficiency. In this case, as 
with the others, the examiner would not know which judg- 
ment to rely on. These three examples show how unreliability 
can manifest itself in the assessment of writing skill with writ- 
ing samples. 

When scores are unstable over time, differ across osten- 
sibly equivalent writing exercises and or differ across inde- 
pendent evaluations of proficiency, there is reason to ques- 
tion the usefulness of the assessment procedures. However, 
when the procedures employed yield scores that arc stable- 
over time, across exercises and across independent evalua- 
tors, those scores can be confidently used for educational de- 
cisions. The test developer is responsible for (I) employing 
assessment development procedures that maximize score re- 
liability, and (2) presenting systematic evidence of score reli- 
ability for review by users 

Three factors are important in developing reliable tests. 
First, the writing skills to be measured must be clearly and 
concisely defined by writing experts. Only then is it possible 
to (I) demonstrate to users, exercise developers, and others 
precisely what skills are to be assessed; (2) judge exercise 
appropriateness; and (3) inform judges about the criteria for 
acceptable performance. 

Second, there must be a clear and unambiguous link be- 
tween the skills to be tested and the exercises developed. This 
interrelationship ensures that exercises give the competent 
vv nter the stimulus and opportunity to demonstrate whatever 
skill(s) the user wants to measure. 

And third, judges must be carefully trained to conduct the 
evaluation according to prespecified criteria and agreed 
upon standards If these three guidelines are followed, 
chances are that scores will be consistent over time, across 
exercises, and across latcrs. If scores are found to be in eon- 



21 



sisterit, assessment procedures should be re-examined in 
light of these guidelines and revised accordingly. 

Validity. Even if a developer of a direct writing assessment 
is successful in achieving srore stability through careful skill 
identification, exercise development and evaluator training, 
the writing assessment developmental task is only partly 
completed. Attention must also be given to the validity of the 
assessment scores. The validity of a score depends on (1) the 
test used to generate that score, and {2) the intended purpose 
for that score. Intended purpose can be identified in a variety 
of ways, each of which can be considered a dimension of 
validity. Cronbach (1971) has identified a number of such 
dimensions that can be applied to the direct assessment of 
writing proficiency. For example, a test may be designed to 
measure a specific set of writing skills. If review of that test by 
qualified experts reveals that the exercises do indeed cover 
those skilK then the test is said to cover the intended content 
validly. It has achieved its content coverage purpose. 

From a different but related perspective, a test that plays a 
significant role in educational decision making (e.g., provides 
a basis for placement or selection) should inspire confidence 
among users. The exercises must appear to assess truly im- 
portant skills. If this face validity is missing, the test will not 
be used — regardless of the actual appropriateness of the ex- 
ercises. It is important that the exercises seem appropriate 
even to the least sophisticated of the intended users. 

There are other ways of revealing whether a test is achiev- 
ing its intended purpose. For example, a test of writing 
proficiency is only one of many potential indicators of writing 
skill. If a test is valid, then scores should bt consistent, with (or 
reflect the same level of proficiency as) other indicators of 
writing skill: for example, performance on job-related or 
real-world writing tasks, amount of formal training in writ- 
ing, grades received in writing courses, and/or scores 
achieved in other objective or writing sample-based tests of 
writing skill. To the extent that the writing assessment devel- 
oper is able to show that performance on a newly developed 
writing assessment is consistent with performance on other 
writing-related tasks, the assessment has achieved its goal of 
reflecting writing proficiency. 

Test purpose largely determines the requirements for doc- 
umenting validity. For example, a direct w riting assessment 



22 



15 

A 



may be very general or it may be narrowly focused to be 
precise and diagnostic. Suppose, for instance, that one 
wished to measure students' letter writing skills. A general 
exercise might present the student with these directions: 

Pretend that you arc applying for a job as a salesperson 
with Acme. Inc. Write a letter to Acme explaining your 
interest and qualifications. 

Because these instructions are very broad, responses can 
only be judged on general merit. Raters w ill likely consider 
such factors as word choice, sentence structure, organization, 
mechanics — in short, the kinds of things one would consider 
in judging any piece of writing. And the result w ill be a gener- 
al profile of overall student writing performance. But sup- 
pose one wished to measure students' performance on ex- 
plicit letter writing skills, in order to diagnose individual 
students' strengths and w eaknesses. Th is would call for some 
modification in the item so that it might read as follows: 

Pretend that you are applying for a job as a salesperson 
with Acme. Inc Write a business letter addressed to Ms. 
Jones. Sales Manager of Acme. 2525 Main, Huntsville. 
New York 20201. Fxplain your interest and 
qualifications. Attempt to convince Acme that you're the 
best person for the job. Use proper business letter form. 

These specific directions will allow responses to be judged 
according to explicit criteria: students' ability to be convinc- 
ing and use proper business letter format. Responses to the 
first item could not be scored in this manner because the in- 
tended audience, purpose and expected letter format were 
not specified in the instructions. In summary, if diagnostic 
information is desired, items must be carefully structured to 
elicit the appropriate type of response. Evidence of success in 
achieving the desired level of precision should be included in 
validation research. 

The purpose for testing may also be considered in terms of 
the specific educational decision in question. That is, a test 
may be intended to rank order examinees in terms of 
proficiency for selecting the most able for further training or 
the least able for remediation. Or the assessment may be 
intended to provide information for mastery /nonmastcry de- 
cisions with regard to specific writing objectives. Because 
these are different purposes, the assessment strategies used 



In 



23 



to achieve them will differ. It is up to the developer to deter- 
mine the usefulness and appropriateness of assessment pro- 
cedures for meeting each specific decision-oriented purpose. 

The essential point is that validity is a reflection of success 
in achieving the testing purpose. As with reliability, the test 
developer has two primary responsibilities: to maximize va- 
lidity through careful test development and to report evi- 
dence of validity for users. Strategies for maximizing validity 
are similar to those for maximizing reliability. The writing 
skills to be assessed should be clearly and unambiguously 
defined. Both the skills and exercises developed to reflect 
those skills should carefully be reviewed by subject experts to 
ensure appropriateness. And once the test is administered 
and scored, scores should be related to other relevant writing 
proficiency indicators to be sure the assessment is focused on 
the desired dimensions of writing skill. 

Developing Exercises 

In the discussion that follows, a writing exercise is consid- 
ered to comprise all stimulus materials and instructions used 
to define the writing task. Developing exercises for direct as- 
sessment of writing involves five carefully conducted steps. 
The first two steps are crucial for any writing assessment: (1) 
assessment planning and (2) exercise development. The re- 
maining three steps, while very important, are not always 
implemented, depending on the resources available and the 
seriousness of the decisions to be made. These are (3) test 
specification and exercise review, (4) exercise pretesting and 
(5) final revision. Each of these five developmental steps is 
discussed in detail in the following paragraphs. 

Assessment planning. The ultimate quality of any assess- 
ment is influenced more by the thoroughness and detail of its 
original blueprint than by any other factor. Several very im- 
portant test design questions must be thoroughly considered. 
If each is not individually considered, the chances of creating 
a valid and reliable assessment — especially a writing assess- 
ment — are greatly reduced. 

The Hrst planning question concerns purpose. The sole 
reason for conducting any educational assessment is to pro- 
vide information to facilitate some educational decision, 
Therefore, the primary step in writing assessment planning 
is to state precisely the specific educational decision to be 

17 



ERIC 



24 



ERIC 



influenced by the resulting scores. Potential decisions include 
( 1 ) diagnosing individual student proficiency in specific writ- 
ing skill areas: (2) rank ordering examinee* with regard to 
general writing proficiency for selection or placement; and 
(3) assessing specific or general writing proficiency to evalu- 
ate the impact of an instructional program. (Additional deci- 
sions will be presented later.) Specific assessment strategies 
vary according io purpose. Therefore, the decision(s) to be 
facilitated must be clearly specified at the outset. 

Second, test developers must determine the specific form 
of w riting to be produced (e.g., essay, business letter, fiction), 
the audience to be aduressed. and the purpose to be served in 
addressing that audience. Any given student's level of 
proficiency will vary as a function of writing form. 

A third planning step calls for identifying the traits to be 
judged in evaluating v\ riting skill and criteria or standards of 
acceptable performance for each trait selected. For example, 
organization, style, tone and sense of audience are typical 
traits : that is, elements of writing skill. In order to judge per- 
formance, however, evaluators need more than a list of 
traits. They need guidelines or criteria for determining good, 
poor or mediocre organization, style, and so on. The com- 
plexity of traits and criteria is a function of assessment pur- 
pose. A broad assessment of overall writing skill allows some 
flexibility in the specification of criteria. For a diagnostic as- 
sessment, on the other hand, both traits and scoring criteria 
must be delineated with great precision. 

In summary, the writing assessment blueprint must in- 
clude (1) the educational decision(s) to be facilitated, (2) the 
v\ riting context (purpose, audience and tvpe of writing to be 
required), and (A) the specific traits or "skills to be judged 
along with criteria for evaluating performance. 

Exercise development. Once planning is completed, the 
developmental goal becomes quite apparent, the design ex- 
ercises that provide the competent student with the necessary 
stimulus and vv riting conditions to demonstrate his 'her level 
of competency. In other words, the writing tasks must inform 
students of the purpose For the writing, the audience to be 
addressed and the type of vv riting expected (necessary condi- 
tions), while at the same time allowing students the latitude 
(e.g . sufficient exercises and time) *o demonstrate their ca- 
pabilities. It should be apparent that unless careful planning 

18 

25 



has preceded this step, appropriate exercise development 
will be difficult at best, 

Here are some specific guidelines to b 4 observed in con- 
structing writing exercises: First, the exercise developer 
should recognize the impossibility of covering all possible in-* 
stances of relevant writing. A realistic objective is to construct 
and include in the assessment an appropriate sample of rele- 
vant exercises. Based on student performance on that sam- 
ple, one can generalize about expected performance in paral- 
lel contexts, To insure the appropriateness of these 
generalizations, however, samples must be carefully select- 
ed. For example, if one wishes to know whether students can 
write expository pros^ for an academic audience, one exer- 
cise is probably not enough; two or three similar exercises 
may be necessary to ensure that the sample is sufficiently 
representative. At the same time, ability to construct other 
forms tor other audiences — e.g., an entertaining piece of 
fiction for young children— is irrelevant to the testing pur- 
pose at hand. 

To use another example, suppose the purpose of an assess- 
ment is to determine mastery of a single clearly focused writ- 
ing objective: ability to present map directions effectively in 
written form. Enough examples of student performance 
should be gathered to ensure that addition of another exer- 
cise would not significantly alter any conclusions about stu- 
dent performance. In other words, exercises must be clearly 
focused and sufficient in number. 

The reader may recognize that this issue of skill sampling is 
related to both reliability and validity, as described earlier. 
For example, it is important to provide enough samples of 
student writing to allow for stable scores (reliability), and to 
fairly and adequately sample the skill domain the test is in- 
tended to cover (validity). 

Certainly the key question in all writing assessment is: 
How much writing is enough? There is no hard and fast an- 
swer. The number of exercises required and the length of 
those exercises are functions of the range of skills to be evalu- 
ated and the level of precision at which those skills are 
defined. Broader assessments covering many skills generally 
require more samples than precisely focused, narrow assess- 
ments. Recent research on this topic (Steele, 1979 and Bre- 
land, 1977) offers some guidance. The Steele research in- 
volved a broad assessment of en d-of -college writing 




ERIC 



proficiency via three 20- to 30-niinutc writing exercises. 
Analysis of score consistency revealed that the use of only 
one or two exercises yielded unreliable scores. However, the 
use of all three exercises raised score consistency to an ac- 
ceptable level. Further, the study revealed that the addition 
of more exercises beyond the' original three would not 
significantly increase reliability. Tru.se results were sup- 
ported ]}\ Brcland's research which revealed that, in a simi- 
lar college-level assessment, a single 20-minute exercise wa- 
incapable of yielding consistent scores. 

Braddock, Lloyd-Jones and Sehoer (1963) iw'fer guidance 
from a different perspective as to the amount of writing 
needed to judge proficiency: 

Hvcn if the investigator is primarily interested in nothing 
but grammar and mechanics, he should afford time for 
the writers to plan their central ideas, organization, and 
supporting details, otherwise their sentence structure 
and mechanics will be produced under artificial circum- 
stances Furthermore, the writers ordinarily should 
have time to edit and proofread their work after they 
have conic to the end of tht : r papers. . . Investigators 
should consider permitting primary grade children to 
take as much as 20 to 30 minutes, intermediate graders as 
much as 35 to 50 minutes, junior high school students 50 
to 75 minutes, high school students 70 to 90 minutes, and 
college students two Hours (to demonstrate proficiency). 
[Kmphasis added.) 

Fxcrcises should frame a clear and concise writing task so 
that students fully understand what is required— whether or 
not the\ can fulfill the requirements. Time pressure is unde- 
sirable, it is an artificial imposition that may not replicate the 
rircumstanrrs in which real life writing occurs. Items should 
offer the w^cer a realistic, sensible challenge so as to main- 
tain interest. Varied stimulus materials (written, auditory, or 
visual) should L e used. Most important, examinees must be 
given time to think, organize, wrte, reread and revise. 

Some writing assessments nave attached great importance 
to revision. As Rivas (1977) notes: 

Rew riting skills arc often considered to be the essence of 
good writing AH of us can express ourselves in some 
form, however ambiguous or inappropriate, but a good 
w ntei know » how to revise such preliminary statements 
so that they become less ambiguous and more appropri- 
ate. 



20 



27 



Part of NAEFs 1974 writing assessment called for writing 
and rewriting the same copy in an attempt to get at revision 
(Rivas, 1977). Students were asked to write a class report 
about the moon, given certain facts. They w ere given I 5 min- 
utes to write the first draft, using a pencil. Upon finishing, 
they were given *3 minutes to revise the first draft, using a 
blue pen so that any changes would stand out clearly. They 
were to!d to make any changes they wished, including cross- 
ing out words or rewriting if necessary: rewriting was not 
required, however. Papers were scored for overall organiza- 
tion (based on the quality of the revision), and were catego- 
rized to indicate the kinds of revisions attempted: cosmetic 
(improved legibility), mechanical, grammatical, transitional, 
informational, holistic (complete rewriting), and so on. 
Though some educators light feel the test was not a true 
measure of revision skills (manj students, for reasons un- 
known, attempted no revision), the NAEP moon test repre- 
sents at least a step toward development of a proper revision 
test. 

Clearly, attention must be given to editing and revision as 
part of any writing assessment, whether by providing 
sufficient time and opportunity for the examinees to revise on 
their own, or by providing specific instructions to revise, as 
NAFP did. If extensive revision (beyond proofreading for 
spelling and other mechanical errors) is desired, it A'ill be 
necessary to construct t^" assessment to allow students time 
for proper reflection— just as in a real-life writing situation. 
It will not be sufficient merely to give students an addit ona! 
five or ten minutes at the end of a writing exercise to "fix 
things up." A better approach might be to allow students 
time to w rite one day, time to revise on a subsequent day. This 
kind of provision may increase administration time and 
costs. However, it will also provide a more relevant (i.e.. true 
to real life) test of revision skills than one-session a cessment. 

Review of specifications and exercises. Whenever possi- 
ble, the writing and assessment personnel responsible for 
assessment specifications and writing exercises should 
present their work to an independent group of writing and 
measurement specialists for review and formative evalua- 
tion. This review should cover — 

1. The purpose for the assessment (decision to be made). 



ERLC 



28 



21 



2. The definition of the assessment context (form of writ- 
ing, audience and reason for writing). 

3. The criteria (skills to be assessed) and standards of ac- 
ceptable performance. 

4. Relevance of exercises in terms of skills to be assessed. 

5. Representativeness of exercises in terms of the domain 
of possible exercises. 

6. Sufficiency of the exercises in providing students with 
the opportunity, in terms of time and tasks, to demon- 
strate proficiency. 

7. Clarity and conciseness of prescribed writing tasks. 

8. Level of interest and challenge conveyed in stimulus ma- 
terials and writing instruction. 

!). Adequacy of instructions and opportunity for revision, 
if that is a desired part of the ass "ssment. 

As the importance of an educational decision and/or as the 
number of students to be included in the w riting assessment 
increases, the importance of independent review increases 
also. ITus, review is less critical with small-scale, local or 
classroom assessments than with large-scale assessments on 
which selection decisions arc often based. 

Exercise pretesting. Whenever possible, exercises should 
be administered to a sample of students prior to actual full- 
scale administration so that potential problems can be 
identified and corrected. Pretesting procedures should 
closely approximate actual administration in terms of type 
(though not number) of pretest students, conditions (e.g., fa- 
cilities, time limits, methods for providing directions) and 
scoring procedures. Developers should then independently 
evaluate results, attending to (I) the level of proficiency dem- 
onstrated (and whether that level seems to fluctuate from ex- 
ercise to exercise), (2) the nature of the responses produced 
(in terms of qu t lity, appropriateness, length and enthu- 
siasm) (3) the consistency of ratings across independent 
evaluations, and (4) the apparent clarity of instructions to 
students. Exercises that appear to yield inconsistent or re- 
peatedly low quality results can be identified and the reasons 
for apparent problems discuss jd. Often, exercises can be ad- 

22 



ERLC 



29 



justed. As with independent exereise review, the importance 
of pretesting increases with the scope and importance of the 
assessment. 

Final exercise revision. The final step in exercise develop- 
ment is to revise exercises on the basis of the review and 
pretest results. As final revisions are made, developers 
should continue to ensure reliability and validity of scores 
through careful use of test specifications, exercise develop- 
ment and preparation for scoring. 

Procedures for Scoring Writing Samples 

Many forms of objective tests can be machine scored. Writ- 
ing tests that rely on writing samples, however, require indi- 
vidual hand scoring by i^alified persons trained to apply 
agreed upon criteria and performance standards. Several 
different methods have been devised for scoring writing 
samples depending on the assessment purpose. The most ap- 
propriate method in any given situation depends upon what 
information one wishes to gain through scoring, how that 
information will be used, and what res mrres are available. 
Some scoring methods are more complicated — and there- 
fore more costly— than others. The purpose of this section is * 
to present a comparative overview of the general advantages 
and disadvantages inherent in each of five approaches: ho- 
1' ac scoring, analytical scoring, primary trait scoring, scor- 
ing for mechanics and grammar, and T-unit analysis. 

Holistic scoring. In holistic scoring, raters review a paper 
for an overall or 'V hole" impression. Specific factors such as 
grammar, usage, style, tone and vocabulary undoubtedly af- 
fect the raters response, but none of these considerations 
is directly addressed As with all rating methods, raters must 
be carefully trained to conduct the evaluation The purpose of 
training is to minimize (at least temporarily) the effects of 
individual biases by helping raters internalize an agreed 
upon set of scoring standards. It is generally recommended 
that raters be experienced in language arts, familiar with 
pertinent terminology and practiced in rating student papers 
at the level for which they will bt scoring. Consistency — both 
among raters and among scores assigned by a single rater — 
is very important in holistic scoring. Initial training takes 
about half a day. but it is also necessary to build in time for 

23 



ERLC 



30 



"refresher" sessions throughout the course of any seoring ac- 
tivity. 

Papers are rated on a numerical scale. NAEP has used 
both 4-point and 8-point scales. Four-point scales are most 
common. An even-numbered scale is recommended because 
it eliminates the convenience of a mid-point "dumping 
ground" for borderline papers. 

Prior to actual scoring, the trainer and the most qualified 
or experienced raters review a subset of the papers to be 
scored in order to identify "range finders." These arc papers 
that are representative of all the papers at a given seoring 
level. With a four-point scale, for example, there would be 
range finders for the 4, 3, 2 and 1 levels. Range finder papers 
must be so typical of papers at a given level that virtually all 
reader^ agree on the assigned score. This is vital because 
range finders are used in training, and later used as models 
to assist raters during seoring. Trainers and their assistants 
may have to read dozens of papers in order to find the "typi- 
cal" range finder papers with which everyone is satisfied. For 
training purposes, it is advisable to have at least two (prefer- 
ably more) range finders at each level. 

Trainers Jo not work from any predetermined set of crite- 
ria in identifying range finders. They may, of course, discuss 
their findings and observations during the process. But it is 
important to realize that in holistic scoring, there is no pre- 
conceived notion of the "ideal" paper. A paper assigned a 
score of 4 will simply be a relatively high quality paper within 
a given group, it may or may not be an excellent paper in its 
own right. As Brown (1977) notes, "It is possible that all of 
the papers at the top of the score are horribly written. They 
may be better than the rest, but ^ till may be unacceptable to 
most teachers of composition." If one has in mind some 
specific criterion of performance that students must meet, 
holistic seoring will not be appropriate. Scoring levels art set 
from within, irrespective of external standards. 

Despite personal preferences, the holistic approach quick- 
ly produces marked consistency among raters— in virtually 
any group. This may be partly the result of peer pressure. But 
more likely it suggests that language arts people can agree — 
though the bases for their conclusions may differ — on what 
constitutes a relatively good and a relatively poor paper. In- 
terrater reliability (that is, agreement between any two ra- 
ters) can be expec ted to run from about .(iOto .80 (Oiederich, 



24 



31 



1974). It may be higher in a few cases, depending upon the 
background of the raters and the amount of training time 
allowed (so that raters can internalize the system>. 

All papers should be read by at least two raters to minimize 
the chance of error resulting from rater fatigue, prejudice or 
Oiher extraneous factors. ACT has achieved an interrater re- 
liability of .75 using two raters and three writing samples 
(ACT, 1979). Increasing the number of -aters beyond two 
does not seem to enhance score reliability (Steele, 1979). 

Scores may be added or averaged across raters to deter- 
mine a final score. Disagreements of more than one rating 
point should be resolved by a third reader or through discus- 
sion by the disagreeing raters. Such disagreements can typi- 
cally be expected to occur in fewer than 5 percent of all cases 
if careful assessment planning and rater training is con- 
ducted. 

Holistic scoring is rapid and efficient. Depending on the 
length of student responses, experienced raters can usually 
go through 30 to 40 papers per hour (though inexperienced 
raters cannot be expected to match this rate). Six hours of 
scoring per day is considered about maximum to maintain 
high reliability. Scoring is intensive work; short hours with 
frequent break periods yield the best results. 

Because scoring levels are never defined, holistic scoring 
does not permit the reporting of specifics on student per- 
formance. After reading hundreds of papers, however, ra- 
ters typically have a supremely clear notion of what factors 
influenced them to assign particular scores. For reporting 
purposes they may translate those observations into level 
definitions. Suppose, for example, that students were asked 
to write a job application letter. One might then say that a 
"typical" 4 paper used proper business letter format, used 
vocabulary and tone appropriate to the occasion, described 
the student's qualifications in a v>ay that reflected a clear un- 
derstanding of job requirements (as presented in the item), 
and reflected consistently good sentence structure, correct 
mechanics, and so on. Such a definition would not necessarily 
apply in total to every 4 paper, but would certainly capture 
the essence of papers at that level and help make results 
meaningful to parents and other audiences. Presentation of 
such definitions in conjunction with sample student papers 
can be an extremely effective reporting technique. 



ERIC 



25 

32 



Analytical scoring. Analytical scoring involves isolating 
one or mute characteristics of v\ riting and scoring them indi- 
vidually. Analytical scoring is most appropriate if one wants 
to measure (and report) students* ability to deal with one or 
more specific conventions of writing: punctuation, organiza- 
tion, syntax, usage, creativity, sense of audience, and so on. 
Traits must be explicit and well defined so that all raters un- 
derstand and agree upon the basis for making judgments. In 
addition, it is necessary to delineate in advance specific and 
complete criteria for judging each trait. In analytical scoring, 
raters rely on written guidelines — not range finders — to as- 
sist them in assigning scores. Ideally, raters should have a 
chance to participate in selecting traits and establishing crite- 
ria. This promotes understanding of and agreement with cri- 
teria, and ultimately enhances interrater reliability. Except 
for the setting of criteria, training and administration proce- 
dures are similar to those for holistic scoring. 

Analytical scoring provides data on specific aspects of stu- 
dent writing performance. But does it really reveal whether, 
m general, students write well? The answer depends on (1) 
whether enough traits are analyzed to provide a comprehen- 
sive picture, and (2) whether those traits analyzed are 
significant — that is. whether they actually contribute to good 
writing. In an effort to identify those characteristics that 
seem most to influence a reader's judgment about the quality 
of a piece of writing, Diederieh (1074) performed a content 
analysis on a sample of student essays scored holisiically. 
Marginal comments were invited (as would not bethe case in 
a traditional holistic session*, and later tallied to isolate those 
factors that seemed to influence experienced raters' scores 
most. Here, in order of significance, are the factors Diederieh 
isolated through that study: 

1. Ideas 

2. Mechanics (including usage, punctuation and spelling) 

3. Organization 

4. Wording 

f). Fla\or (or style) 

Of course, individual examiners may identify other traits 
they w ish to score. However, this list of traits permits a rea- 
sonably comprehensive analysis of w riting. 



26 33 



Factor-by-factor analysis of writing elements is more time 
consuming than holistic scoring. Depending on how many 
factors one looks at, it requires two to three times as long (or 
more) to rate a paper analytically as it does holisticallv. 

Analytical rating has been criticized because there is some 
indication it produces a * 4 halo M effect; that is, students who 
are rated high on one trait will tend to be rated high on all 
traits. Page (1968) explains, 

A constant danger in multi-trait ratings is that they may 
reflect little more than some general halo effect, and that 
the presumed differential traits will really not be mean- 
ingful We find (in our research) a very large halo, or 

tendency for ratings to agree with each other. 

Despite these disadvantages, however, analytical scoring 
has one great advantage: it provides potential for trait-by- 
trait analysis of students' writing proficiency. 

Primary trait scoring. Primary trait scoring is similar to 
analytical scoring in that it focuses on a specific characteristic 
(or characteristics) of a given piece of writing. However, 
while analytical scoring attempts to isolate those characteris- 
tics important to any piece of writing in any situation, pri- 
mary trait analysis is rhetorically and situationally specific. 
The most important — or primary — trait(s) in a letter to the 
editor will not likely be the same as that (those) in a set of 
directions ft , assembling a bicycle. 

The primary trait system is based on the premise that all 
writing is done in terms of an audience, and that successful 
writing will have the desired effect upon that audience. For 
example, a good mystery story will excite and entertain the 
reader; a good letter of application will get the interview. In a 
scoring situation, of course, papers must be judged on the 
likelihood of their producing the desired response. 

Because they are situation-specific, primary traits differ 
from item to item, depending on the nature of the assign- 
ment. Suppose a student were asked to give directions for 
dnving from his 'her home to school. The primary trait might 
then be sequential organization, for any clear, unambiguous 
set of directions would necessarily be well organized with de- 
tails presented in proper order. As Mi'llis (1974) points out. 
^Successful papers will have that | primary] trait; unsuccess- 
ful papers will not— regardless of how well written they may 



ERIC 



be in other respects. 



Raters determine that some traits are essential to success 
in a given assignment. However, additional traits that con- 
tribute but are not necessarily essential to the success of a 
paper are termed "secondary" traits and may also be in- 
cluded in the e\aluation. if they can be clearly defined and 
exemplified for raters Scores may be weighted to show the 
relative importance of various traits, if desired, then totalled 
to indicate the overall quality of the paper. 

The first step in primary trait scoring is to determine w hich 
trait or traits will be scored. The second is to develop a scor- 
ing guide to aid raters in assigning scores. To illustrate, con- 
sider the following guide developed by NAEP for scoring 
"letters to the principal on solving a problem in school." It 
was determined that a good letter would identify the prob- 
lem, present a solution, and explain how that solution w ould 
improve the school Here are NAEPs criterion levels: 

1. Respondents do not identify a problem or give no evi- 
dence that the problem can be solved or is w orth solving, 

2. Respondents identify a problem and either tell how to 
solve it or tell how the school would be improved if it 
were solved. 

A Respondents identify a problem, explain how to solve 
the problem, and tell how the school would be improved 
if the problem were solved. 

4. Respondents include the elements of a "3" paper. In ad- 
dition, the elements are expanded and presented in a 
systematic structure that reflects the steps necessary to 
solve the problem (Mullis. 1 974). 

Range finder papers may be used i i addition to the scoring 
guide. This practice is not common, lowever, for many raters 
find it cumbersome to rely on two points of reference. 

All raters should be familiar with tre rationale underlying 
the primary trait system, and with the !evel definitions to be 
used in scoring. Raters must accept the fact that they will be 
looking for specific, well-defined traits, and be cautious about 



ommends that raters prescore (for practice* at least 10 sam- 
ple papers at each level during training in order to become 
comfortable with applying the criteria (Mullis. 1974). 



allowing extraneous criteria to influ 




NAEP rcc- 



35 



As with analytical scoring, defining criterion levels is the 
most time consuming step. It may be necessary to "test" nu- 
merous definitions on sample papers in order to come up 
with a set that works. Herein lies a strong argument for keep- 
ing the list of traits to be scored brief. On an average, count 
oft a day of trial and error, discussion and debate for each 
trait to be defined. This may sound time consuming, but the 
quality and clarity of the final definitions, and the ease with 
which they can be applied, will readily justify the time spent. 

Like analytical scoring, primary trait scoring can allow the 
reporting of student performance with respect to specific 
characteristics: e.g., organization, awareness of audience. 
For this reason, primary trait scoring is greatly favored over 
holistic scoring in contexts where more precise information is 
needed. But this advantage should be carefull> weighed 
against the time and effort required to set up a "/crkable 
primary trait scoring system. Aside from adopting already 
wntten criteria (e.g., from NAEP), there are no known short- 
cuts. 

Scoring language usage and mechanics. Of the types of 
scoring mentioned thus far, the scoring of writing mechanics 
is the most time consuming, and the most complex approach 
for which to provide training. This realization often comes as 
a great surprise to inexperienced raters, who may look on 
mechanics as a rather cut and dried affair— until faced *vith 
the prospect of netting up a scoring system. 

The fact is, the standards of appropriate usage are subject 
to continual change through popular usage. So rapid has that 
change become now that even usage textbooks sometimes 
reflect different notions of what is appropriate. For the sake 
of consistency in scoring mechanics, It is necessary that a 
fairly comprehensive guide be developed. It is possible, of 
course, to use a standard reference— an English hand- 
book— for this purpose. But raters must agree to abide by the 
document, and if there are too rnany areas of disagreement, 
it may be simpler to design their own. Whatever the decision, 
it is imperative that everyone agree to s<*orc according to the 
rules of the guide, regardless of personal preference. Other- 
wise, the inconsistency will render the scores useless. 

Several other decisions must be made as well: 

1. Whether to count errors of commission and errors of 
omission equally. 



29 

ERIC 



2. Whether to require formal usage, or to base guide rules 
on informal usage. 

3. Whether to count errors involving concepts or rules with 
which students may not be familiar (e,g,, seventh grad- 
ers may not have been taught proper use of colon and 
semicolon — should this be considered?). 

4. Whether to count every identifiable error or to focus on 
specific : :cas for easier reporting of results, 

In addition, raters must establish a workable rating scale. If 
they choose to retain a 4-point scale, for example, it will be 
necessary to determine how many errors will be allowed in a 
4 paper, how many in a 3 and so on, 

One additional step necessary in scoring writing me- 
chanics is obtaining an accurate word count for each paper. 
Errors can then be tabulated per 100 words, Analyzing er- 
rors in this way does not penalize those who write long re- 
sponses, or give unfair advantage to those who write very 
little. 

Test administrators should be cautioned about scoring me- 
chanics as one trait within a primary trait system. As the 
foregoing discussion ^icates, it is far more time consuming 
to score than other U , and demands a number of special 
considerations. Thereiore, test administrators should weigh 
carefully the advantages and disadvantages of such a com- 
bined approach. 

Educators considering using the direct assessment ap- 
proach to evaluate mechanics should remember that under- 
standing of such usage elements as punctuation, grammar, 
diction, and sentence structure can be very efficiently, validly 
and reliably assessed using available indirect assessment 
measures. For mechanics or usage assessment, very careful 
consideration should be given to the objective test because it 
forces examinees to demonstrate explicit ability to deal effec- 
tively with the precise elements being tested. If a writing sam- 
ple is used to assess these elements, examinees will typically 
avoid language constructions which they are unable to use 
effectively. Further, inconsistencies in usage patterns will 
make comparisons among examinees, on the basis of me- 
chanics, difficult if not impossible. Such comparisons are 
generally possible with objective usage tests. In addition, be- 
cause a writing sample taps but a small, arbitrary portion of 



30 



o 37 

ERLC 



an examinee's proficiency in writing mechanics, results can- 
not appropriately be use(J in diagnosis, whereas objective 
test results may be quite suitable for this purpose. 

T-unit analysis. The concept of T-unit analysis was intro- 
duced in the 60s, and has gained oopularity ever since as a 
means of measuring writing sophistication. A T-unit may be 
thought of as an independent clause plus whatever subordi- 
nate clauses or phrases accompany it. In simple terms, a T- 
unit is the smallest group of words in a piece of writing that 
could be punctuated as a sentence (T stands for "termina- 
ble"). Consider the following passage: 

I yelled at my cat Manfred and he ran away, but he came 
home when he got hungry, 

This passage has only one terminal mark of punctuation as 
written, but actually contains three T-units : 

• I yelled at my cat Manfred 

• and he ran away, 

• but he came home when he got hungry. 

Each of these T-units is an independent clause that could be 
punctuated as a sentence. Note that T-unit analysis is inde- 
pendent of punctuation; a writer may or may not punctuate 
T-units as sentences. 

Studies have shown that T-unit length tends to increase 
with the age and skill of the writer* (Hunt, 1 977). In addition, 
it has been demonstrated that with increased skill, writers 
can incorporate a greater number of distinct concepts into a 
single T-unit. Consider the following example, using six short 
sentences, each of which consists of one T-unit, abstracted 
from a longer piece: 

1. Aluminum is a metal 

2. It is abundant. 

3. It has many uses. 

4. It comes from bauxite. 

*There are notable exceptions, therefore, this tendem , cannot be applied as a gen- 
eral rule Highly experienced, sophisticated writers may consistently use short T- 
un-ts Conversely, the use of lengthy T-units does not of itself render one a skillful 
w nter. 



ERIC 



31 

38 



Table 3 

A Comparison of Scoring Methods for 
Direct Writing Assessment 

DESCRIPTOR HOLISTIC ANALYTICAL 



GENERAL 
CAPABILITIES 



Comp'thrnsive. g< neral 
pictu* e or student perform- 
ance, writing viewed as a 
unit ed coherent whole 
Applicable to any writing 
(ask 



Thorough, trait by trait 
analysts of writing, provides 
comprehensive picture of 
performance 1/ enough traits 
are analyzed, trails are those 
important to any piece of 
v\ nting in any situation (e g . 
organization, wording, 
mechanics) 



RELIABILITY 



High reliability if standards 
arc carefully established and 
raters are carefully trained 



High reliability if criteria and 
standards are well defined, 
and careful training is 

conducted 



PREPARATION 
TIME 



READERS 



Up to one day per item to 
identify range finder (model) 
papers, up to one-half day to 
train readers using 4-point 
scale, full day to train with 8- 
point scale 



Qualified language arts 
personnel recommended, 
high reliability can be 
achieved with non-language 
arts readers giver, z jfficient 
training 



One full day to identify traits . 
one day per trait to develop 
scoring criteria (unless traits 
and criteria are borrowed 
from another source), one to 
two da>s to review results of 
pilot test and refine traits or 
criteria as necessary, one-half 
day to train raters 

Qualified language arts 
personnel recommended 



SCORING TIME 



CLASSROOM 
USE 



One to two minutes per paper 
(experienced readers may 
rrad faster) 



One to two minutes per paper 
per trait 



May be adapted for use in 
class 



May be adapted for use in 
class 



REPORTING 



Allows reporting on students* 
overall writing skill 



Allows reporting of student 
performance on wide range 
of generaiizable traits (1 e , 
the qualities considered 
important to all good writing) 



GROUP 
SAMPLE SIZE* 



Primarily usable with a larger 
sample, with a small sample, 
responses may be difficult to 



Best with smaller samples, 
extensive scoring time may 
make costs prohibitive wth 
larger groups 



* These are very general guidelines Due to the nature of the sco ring-cost/ amount-of- 
mformation trade-off across scoring methods, readers are urged to seek the technical assistance 
of a qualified wnnng assessment specialist if there is a question regarding tnc appropriate use of 
available scoring resources 



32 



9 

ERJC 



39 



PRIMARY TRAIT 


WRITING 
MECHANICS 


T-JINIT 
ANALYSIS 


Highly focused analysis of 
situation-specific primary 
trait (and pocsibiy secondary 
traits), provides specific 
information on a narrowly 
defined writing task <e g , 
ability to recount details in 
chronological order) 


Can provide either a general ^ 
or a specific profile of the-""^ 
Student's ability to aSe 
m echapjcrpfoperly . 


Provides a measure of 
syntactical sophistica- 
tion 


High reliability if criteria and 
standards are well defined, 
and careful training is 
conducted. 


High reliability if given 
sufficient training time and 
authoritative, complete, 
acceptable guidelines (e g . an 
English handbook). 


High reliability 
provided trained and 
experienced raters are 
used 


One full day to identify traits, 
one day per trait to develop 
scoring criteria (unless traits 
and criteria a r e borrowed 1 
from another source); one *o 
two days to review results of 
pilot test and refine traits or 
criteria as necessary, one-half 
day to train raters 


One to two days to set up a 
scoring system (unless 
borrowed from another 
source). Minimum of one day 
to internalize the scoring 
system and practice scoring 


Half day to full day. 
depending on raters' 
previous experience 


Qualified language arts 
personnel recommended, 
non-language arts staf f may 
be able 1 3 score some traits. 


Qualified language arts 
personnel recommended. 


Raters must be 
expert need language 
arts personnel, 
preferably those 
already familiar with 
the concept of T-unit 
analysis 


One to two mm r*er paper 
per trait 


Five minutes or more per 
paper, depending on number 
of criteria 


Vanes greatly, 
depending on raters' 
•skill. 


May be adapted for use in 
class 


May be adapted for use in 
class 


May be adapted for 
use in class 


Allows reporting of student 
performance on one or more 
situation -specific traits 
important to a particular task 


Allows reporting of group or 
individual data on students' 
general strengths or 
weaknesses in mechanics 


Allows group or 
individual reporting 
on syntactical 
sophistication 


Generally more cost-effective 
with smaller samples, 
depending on the number of 
traits to be scored (with one 
trait, sample size is not an 
tissue) 


Best with smaller samples , 
extensive scoring time may 
make costs prohibitive with 
larger groups 


BV.twith smaller 
sa k m s. extensive 
scoring time may 
make costs prombitive 
with larger groups 



33 



9) 

ERIC 



40 



5. bauxite is an ore. 



6. Bauxite looks like clay. 

Here's how a fourth grader rewrote the passage: 

Aluminum is i metal and it is abundant. It has many 
uses and it comes from bauxite. Bauxite is an ore and 
looks like clay. (6 sentences to 5 T-units) 

The revision of a typical eighth grader: 

Aluminum is an abundant metal, has many u^es, and 
comes from bauxite. Bauxite is an ore that looks like 
clay. (6 sentences into 2 T-units) 

And finally, the revision of a skilled adult, a professional 
writer: 

Aluminum, an abundant metal with many uses, comes 
from bauxite, a claylike ore 'G sentences into I T-unit) 

T-unit analysis and review of conversions (from simple 
sentences into T-units) provide a good measure of sentence 
maturity and of a student's ability to consolidate multiple 
thoughts. 

Sophisticated, condensed writing has undeniable appeal, 
T-unit analysis used in conjunction with holistic scoring is 
likely to reveal that the highest scored papers (i.e., those that 
appealed most to readers) were in fact those with the most 
sophisticated use ;>f T-units. 

T-unit analysis is still in the experimental stages It is ^me 
consuming and costly to conduct. Moreover, it can . c 
done by highly trained language arts specialists. Furtl 
search and use may, however, reveal more widespread appli- 
cability than has so far been anticipated. Two interesting 
footnotes: syntactical maturity is apparently reflected in oral 
speech as well ,is in writing, and such maturity can be 
enhanced through a sentence combining curriculum (Hunt, 
1977), 

A Comparison of Scoring Methods 

Table 3 offers a comparative ov rview of the scoring pro- 
cedures discussed in this section, focusing on several key de- 
scriptors. 



34 

O 

ERLC 



Chapter III: Adapting Writing 
Assessment to Specific Purposes 



Educational tests have only one function: to facilitate edu- 
cational decision making. A test should not be administered, 
therefore, until the decision or decisions that rest on the re- 
sults of that test have been clearly articulated. This applies to 
all tests, including writing tests. 

In many educational contexts, writing tests can be and are 
being used effectively. For example, tests can play a role in 
instructional management decisions. Such decisions include 
(1) the diagnosis of individual learner strengths and 
weaknesses for instructional planning, (2) the placement of 
students into the next most appropriate level of instruction, 
and (3) educational and vocational planning as part of stu- 
dent guidance and counseling. 

Tests can also be administered at key points in an educa- 
tional program to check student development in order to (1) 
screen the admission to an advanced or remedial program, 
or (2) certify minimum prohciency (e.g., for high school grad- 
uation). 

And finally, test* can be used for program evaluation pur- 
poses such as in ( 1 ) large-scale survey assessment, (2) forma- 
tive program evaluation, and (3) summative program evalua- 
tion. 

In the discussion that follows, each of these eight contexts 
is described in terms of the decision to be made, the primary 
decision makers, and the type of writing skill information 
needed to make the decision. Decision makers include stu- 



ERLC 



35 

42 



dents, parents, teachers, administrators (including specific 
project or program administrators, as well as building-, dis- 
trict- ind state-level administrators), guidance counselors, 
and the public (including taxpayers and elected officials). 

Using Tests to Manage Instruction 

Diagnosis. Teachers often use tests and other perform- 
ance indicators to track each student's level of development, 
thereby determining where that student is in the instruc- 
tional sequence, and anticipating thj next appropriate level 
of instruction. Diagnostic data gathered via direct writing as- 
sessment can help individualize instruction by simplifying 
student grouping or instructional scheduling decisions. In 
addition, diagnostic writing skill data gathered over time 
may provide a basis for grading or communicating progress 
to parents. 

Placement. Decision makers such as teachers and educa- 
tional administrators must place each student at the level of 
instruction best suited to his/her skills. Typically, they use 
such performance indicators as writing skill tests, previous 
courses completed, and grades to rank order students along 
a continuum of writing skill development, then place them in 
the appropriate course. 

Guidance and Counseling. In deciding their future educa- 
tional or vocational activities, students need to know how 
their writing skill compares to that of other students with 
whom they could compete. Performance indicators like writ- 
ingtests can help provide such information. Writing tests can 
indicate the probability that a given student will find success 
and satisfaction in a program or professional position for 
which writing skill is a prerequisite. More specifically, nor- 
mative test data can help students, their parents and their 
gu.dance counselors answer students' typical questions: 
Should I pursue advanced training in a postsecondary educa- 
tional program in which writing is a key element? In which 
school or job am I most likely to be successful? Though test 
scores should never serve as the sole basis for answering 
such questions, they can play a valuable role. 

Using Tests to Select Students 

Admission. It is not uncommon to have more candidates 
thar. program openings. When this happens, teachers, coun- 
selors and administrators must select students for admis- 

43 



sion. Performance indicators such as writing tests can be 
used to rank order examinees to facilitate selection. Selection 
decisions most often affect those at either end of the skill 
continuum. That is, more able students are selected for inclu- 
sion in advanced writing progiams, while less able students 
are selected for remedial writing programs. 

Certification. Tests tailored to a specified certification do- 
main are often used to verify and document a student's mas- 
tery of specific knowledge or skills. For example, teachers 
might use writing tests to certify mastery of beginning writ- 
ing skills for purposes of grading or promotion. Or district 
and state administrators might use minimum writing compe- 
tency tests as criteria for high school graduation. Both exam- 
ples show how certification may be accomplished through 
testing. 

Using Tests to Evaluate Programs 

Survey Assessment. Survey assessment refers to the collec- 
tion of group achievement data to determine general educa- 
tional development (e.g., in \ Mng). Data may be gathered 
by administering a writing test to a carefully selected random 
sample of students in the target population. Survey assess- 
ment is often cyclical, thus allowing for the examination of 
trends in writing skill development over time. Decision 
makers include (1) building-, district- or state-level adminis- 
trators who allocate resources for special instructional needs 
pinpointed by the assessment. ;>r (2) the public, which makes 
value judgments regarding perceived and reported levels of 
student writing skill development. 

Formative Evaluation. In the context of formative pro- 
gram evaluation, program administrators and teachers at- 
tempt to determine which components of instruction are 
functioning as intended and which need further refinement. 
They may test students on -ach of the intermediate and final 
outcomes of a writing program, for example. Assessment for 
formative evaluation may also involve mu.iiplc test adminis- 
trations to determine the effectiveness of ongoing 
modifications in a writing program. 

Summative Evaluation. Summative evaluation reveals a 
program's overall merit^ suggesting whether that program 
should be continued or terminated. Tests designed to assess 
students* performance on final learning outcomes are an im- 
portant part of such an evaluation. Teachers, program. 



37 

44 



building or district administrators, and the public (including 
the board of education) may be involved in summative evalu- 
ation decisions. As with survey assessment and formative 
evaluation, multiple test administrations are common. Tests 
may be given prior to as well as following instruction, with 
retention testing after a given time interval. 

Selecting Examinees as a Function of Purpose 

In the three program evaluation contexts just cited (survey 
assessment, formative evaluation, and summative evalua- 
tion), testing costs can be significantly reduced through ran- 
dom sampling. If the student population is very large, then 
data summarized across a carefully selected random subset 
of students will reflect group performance every bit as accu- 
rately as if every student were tested— often at a fraction of 
the cost. It is not within the scope of this paper to present all 
the important considerations in sampling, as each specific 
educational situation is unique. The intent is to point out the 
potential financial advantage of sampling and to urge its con- 
sideration 

It should be apparent that sampling is not feasible with 
instructional management or student screening decisions be- 
cause in these contexts, individual student data are neces- 
sary. 

Developing Exercises as a Function of Purpose 

Generally, the process for developing writing assessment 
exercises remains constant across all eight educational as- 
sessment contexts. Careful planning is essential in all cases, 
and attention muM always be given to designing exercises 
that give the examinee sufficient opportunity (in terms of 
time, appropriate stimulus and range of tasks) to demon- 
strate proficiency. Further, in all cases, the type of audience 
and purpose for communication should be made clear to the 
student. In addition, exercises should frame challenging 
tasks based on varied and directly relevant stimulus mate- 
rials. And finally, ; n all cases, clear and concise instructions 
are essential. 

A few factors vaiy according to context and the nature of 
the decisions to be mark- As a general rule, the specificity of 
an exercise (i.e.. level of detail in instructions) should increase 
along with the specificity of the skills to be assessed. In other 
words, exercises to be used in broad survey assessment need 




not be quite so focused as exercises to be used in, say, a 
diagnostic test. 

The amount of writing required might also vary, depend- 
ing on the decisions to be made. For example, it might be 
possible to rank order students in terms of general writing 
proficiency (via holistic scoring) on the basis of three or four 
general, relatively short writing samples. However, it would 
probably he very difficult to use those same three or four 
short writing samples to reliably and validly determine 
whether a student had mastered 10 to 15 specific, indepen- 
dent writing skills. Generally the more precise and numer- 
ous the criteria and standards of acceptable performance, 
the more writing necued to evaluate performance. 

And finally, exercises developed for use in a large-scale 
statewide assessment or where important selection decisions 
are pending must be (1) independently reviewed by writing 
and osscssment experts and (2) pretested. Pretesting and re- 
view are less critical with writing assessment exercises used 
in inst uctional classroom management. 

Selecting Scoring Procedures as a Function of Purpose 

Selection of scoring procedures is, in effect, part of assess- 
ment planning, since this decision is influenced by the pur- 
pose for the assessment and criteria to be used in judging 
writing proficiency. Though it is possible to conceptualize in- 
stances within each of the eight educational assessment con- 
texts in which any given scoring approach could be em- 
ployed, the actual scoring approach most f ommonly used will 
vary by context. 

To illustrate, diagnosis of individual student strengths and 
weaknesses demands the level of specificity provided 
through analytical, primary trait or mechanics scoring. 
Placement and guidance, on the other hand, may only re- 
quire holistic ratings because the objective of assessment is 
simply to rank order students on a continuum of writing skill. 

Consider measurement of student status. While selection 
n^ay require a holistic ranking of students, certification may 
be done through holistic ratings or analytical or primary trait 
scoring, depending on the specificity of the minimum compe- 
tencies to be certified. 

Holistic scoring procedures are well suited to the relatively 
broad, unfocused nature of large-scale survey assessment. 
However, analytical scoring may serve as well if the desire for 



ERIC 



:v.) 

46 



Table 4 

Writing Assessment Procedures 
as a Function of Assessment Context 



Assessment Context Assessment Procedure 



Context 


Decision 
to be made 


Decision 
makers 


Examinees 
assessed 


Exercise 
specificity 


Diagnosis 


Determine 
and traek 
edueational 
development 


Teacher 
Student 

r 


Individual 


Specific 


Placement 


Match level 
of student 
development 
to level of 
instruction 


Teacher 
Counselor 


Individual 


General 


Guidance 


Rank order 
for educa- 
tional 
planning 
decisions 


Administrator 

Counselor 

Teacher 

Parent 

Student 


Individual 


General 


Selection 


Rank order 
examinees 
for selec- 
tion into 
instruction 


Administrator 

Counselor 

Teacher 


Individual 


General 


Certification 


Determine 
mastery of 
specific 
competencies 


Teacher 
Student 


Individual 


Specific 


Survey 
Assessment 


Policy 
decision re- 
status of 
student 
educational 
development 


Administrators 
Public 


Sample 


General 


Formative 
Evaluation 


Determine 

components 

of 

instructional 
program in 
need of 
revision 


Program 

Developer 

Teacher 


Sample 


Depends on 

program 

objectives 


Summative 
Evaluation 


Program 
continuation 


Administrator 


Sample 


Depends on 

program 

objectives 



40 

o 47 

ERIC 



Assessment Procedure 



Context Holistic Analytical Primary trait Mechanics T-unit 
Diagnosis X XX 



Placement X X 



Guidance X 



Selection X 



Survey X X 

Assessment 



Formative 
Evaluation 



Summative X X 

Evaluation 



Certification X X XX 



48 



individual data justifies the additional time required. 

Scoring procedures for formative evaluation depend on 
the specificity of the enabling and terminal objectives that 
guide instruction. If overall writing proficiency is the focus of 
the program, analytical scoring may be selected. However, if 
instruction focuses on situation-specific rhetorical skills, pri- 
mary trait scoring may be most appropriate. Similarly, em- 
phasis on mechanics indicates selection of a corresponding 
scoring approach. In most instances, formative evaluation 
demands scoring procedures more specific than holistic. 

With summative evaluation, holistic assessment may pro- 
vide sufficient data to judge program viability. However, if 
stated program goals subdivide writing skill into component 
parts, analytical scoring may be appropriate. Instructional 
programs in writing seldom focus on a single rhetorical cir- 
cumstance Rather, they deal with writing of many types, for 
many purposes. Therefore, primary trait scoring will have 
limited value in this context. 

Ensuring Efficient, Effective, and High Quality 
Assessment 

The keys to successful direct w riting assessment are careful 
planning, thoughtful and creative exercise development, and 
consistent application of performance criteria dunng scor- 
ing. If these ^actors are given meticulous attention, the as- 
sessment will icld data that are (1) sufficiently precise to 
support necessary decisions. (2) reliable, (3) valid for the in- 
tended purpose, and (4) maximally cost-effective. 

The preceding discussion is intended to acquaint the in- 
terested educator with available assessment strategies and to 
highlight some of the issues involved in selecting a scoring 
procedure appropriate for a specific context. Table 4 pro- 
vides an overall summary of the key points made in that dis- 
cussion 

The reader is encouraged to refer to the list of KIM.R- 
r\< IS following this section and to the AIMM \I)IX. 
which names contact persons in many states u ho can 
offer farther information on writing assessment ap- 
proaches and contingencies. In addition. ( APT we A 
comes further in q a i ri es rega rd i ng u riling ass i >ss m e n t 



\2 

4y 



REFERENCES 



American College Testing Program. Alternative strategies for the 
assessment of writing proficiency. Iowa City, IA: Author, 1979. 

Braddock, R., Lloyd-Jones. R.. and Schoer, L. Research in written 
composition. Urbana, IL: National Council of Teachers of Eng- 
lish, 1963. 

Breland, H.M., Conlon, G.C., and Rogosa, D. A preliminary study 
of the writing ability. New York, NY: College Entrance Examina- 
tion Board, 1966. 

Cronbach, L.J. Test validity. In R.L. Thorndike, Educational Meas- 
urement. Washington, D.C.: American Council on Education. 
1971. 

Diederich, P.B. Measuring growth in English. Urbana, IL: National 
Council of Teachers of English, 1974. 

Fredrick, V. Writing assessment research report: A national sur- 
vey. Monograph published by the Wisconsin Department of Pub- 
lic Instruction, Madison, WI: 1979. 

Godshalk, F.I.. Swineford, F.. and Coffman, W E. The measure- 
ment of writing ability. New York, NY: College Entrance Exami- 
nation Board, 1966. 

Hogan,T.P., and Mishler, C. Relationships between essay tests and 
objective tests of language skills for elementary school students. 
Journal of Educational Measurement, 1980, 77, 219-227. 

Hunt. K.W. Early blooming and late blooming syntactic structures. 
In C. Cooper and L. Odell (Eds.). Evaluating writing. Urbana, IL: 
National Council of Teachers of English, 1 977. 

Huntley. R.M., Schmeiser, C, and Stiggins, R. The assessment of 
rhetorical proficiency: The role of objective tests and writing 
samples. Paper presented at the annual meeting of the National 
Council on Measurement in Education. 1979. 

Mt on, V., and McCready, M. Survey of large-scale writing assess- 
ment programs. Ruston, LA: Louisiana Technological Universi- 
ty, 1981. 

Moss, Pamela A.. Cole, Nancy S., and K* ampalikit, Choosak. A 
comparison of procedures to assess written language skills in 
grades, 10, 7 and 4. Paper presented at the Annual Meeting of 
the American Educational Research Association, Los Aneeles. 
CA:1981. 

Mullis, I. The primary trait system for scoring writing tasks. Den- 
ver, CO: National Assessment of Educational Progress, 1974. 

National Assessment of Educational Progress. Writing mechanics 
1969-1 97*: A capsule descr'ption of changes in writing me- 
chanics. Denver. CO: Author. 1975. 



ERLC 



50 



43 



, Page, E.G. The analysis of essays by computer (Final Report, U.S. 
Office of Education Project 6-1318). Storrs. CT: University of 
Connecticut, 1968. 

Rivas, F. Writelrewrite: An assessment of revision skills (Writing 
Report No. 05-W-04). Denver, CO: National Assessment of Edu- 
cational Progress, 1977. 

Steele, J. The assessment of writing proficiency via qualitative rat- 
ings of writing samples. Paper presented at the Annual Meeting 
of the National Council on Measurement in Education, 1979. 

Stiggins. R.J. A comparison of direct and indirect writing assess- 
ment methods. Research in the Teaching of English (in press) 



Si 



APPENDIX: 

Large-Scale Writing Assessment 
Program Profiles 

(from Melton & McCready, 1981) 



ERIC 



52 



TESTING METHOD 



Sute 


Grades 
Tested 


Sample 

Size 
(X 1000) 


Objective 
Test 


Writing 
Sample 


Exercises 
Developed By 


Alabama 


3.b.9 


40-60* 




X 


Su:« 

Department 
University faculty 
Teachers 


Delaware 


1-8,11 


<5 


X 


X 


State 

Department 
NAEP Exercises 


California 


3.6.12 


<5 


X 


X 


Local Districts 


Florida 


3.5.8.U 


<5 


X 


X 


State 

Department 
Teachers 


Hawaii 


4.8,11 


<5 




X 


Committee 


Ma Yin 

loano 


9 


10-20* 




v 

A 


State 

Department 


Louisiana 


3.7.10 
• 


<5 


X 


X 


Teachers 


Maine 


4.8.11 


10-20* 




X 


NAEP Exercises 


Maryland 


9-12 


>60* 


x 


x 


State 

Department 
Contractor 
Teachers 
Local Districts 


Massachusetts 


7.8.9.12 


>60* 




X 


State 

Department 
Teachers 



•Entire Population Tested 



46 



53 



WRITO i SAMPLE DESCRIPTION 



Kind of 
Writing 


Scoring 
Method 


Results 
Used By 


Contact 


Narration 


Holistic 


Local Districts 

Schools 

Teachers 


Dr William Berry man 
State Dept of Education 
Room 607, State Office Bldg. 
Montgomery. AL 36130 


Narration 
Persuasion 


Primary Trait 1 


Local Districts 
Schools 
Teachers 
State 

Department 
Public Report 


Mr. Robert Bigelow 
State Dept. of Public Instr 
Townsend Bldg., Box 1402 
Dover, DE 19901 


Varies by 
District 


Holistic 


Local Districts 

Schools 

Teachers 


Dr Dale Carlson 
State Dept. of Education 
721 Capitol Mall 
Sacramento, CA 95814 


Special Task 


Analytical 


State Summary 
Disseminated 
on request 


Dr Thomas H. Fisher 
State Dept. of Education 
Knott Building 
Tallahassee. FL 32301 


Narration 

Exposition 
Description 
Persuasion 


Holistic 


Local Districts 


Dr Selvin Chin- Chance 
State Dept of Education 
Queen Liliuokalmi Bldg 
1390 Miller Street 
PO. Box 2360 
Honolulu, HI 96804 


Narration 

Exposition 
Description 
Persuasion 


Holistic 


Local Districts 
Schools 


Ms. Karen Underwood 
State Dept ofEdocatio n 
LenB Jordan Office Bldg 
Boise. ID 83720 


Narration 

Exposition 
Persuasion 


Primary Trait 


Local Districts 
Schools 
Teachers 
State 

Drpr r tment 


Mr Joseph Williams 
Bureau of Assessment 
State Dept. of Education 
PO Box 44064 
Baton Rouge, LA 70804 


Narration 

Exposition 
Description 


Holistic 


Local Districts 
NAEP 


Dr Horace P Maxcy, Jr 
State Dept. of Educational and 
Cultural Services 
State Office Building 
Augusta, ME 04333 


Narration 

Exposition , 
Description 


Holistic 


State 

Department 
Local Districts 
Schools 


Dr William Grant 
State Dept. of Education 
BWI Airport 
PO Box 87 17 
Baltimore, MD21240 


Descnption 
Persuasion 


Holistic 
Analytical 


Local Districts 
Schools 


Dr Allan Hartman 

State Dept of Education 
^1 St James 
Boston. MA 021 16 



o 

ERIC 



47 

54 



TESTING METHOD 



Stele 


Grades 
Tested 


Sample 

Size Objective 
<X 1000) Test 


Writing 
Sample 


Exercises 
Developed By 


Michigan** 


47.10 


<5 


X 


Sute 

Department 

1 f rt i ret fx/ 
\j iuvc rally 

Faculty 

NAEP Exercises 


Minnesota 


1 


<5 


X 


State 

Department 

T mm fVt m TX 

University 
Faculty 
Local Districts 


Nevada 


3,6.9-12 


5-10* X 


x 


State 

Department 
Teachers 


New 

Hampshire 


5,9.12 


<5 


X - 


State 

Department 


New Jersev 


9 


>60* X 


x 


VjUM ir m\. lur 


New Mexico 


JO 


Unspecified 


X 


State 

Department 
Teachers 
Local Districts 


North 
( arohna 


11 


<5 


x 




Ohio 


8.12 


<5 X 


x 


State 

Department 
University 
Faculty 
Teachers 


Oregon 


4.7.11 


<5 X 


X 


State 

Department 


Pennsylvania 


5.8. 11 


Unspecified X 







*Fntire Population Tested 
•*A*ict*mrnt I'ndcr hevelopment 



48 



55 



WRITING SAMPLE DESCRIPTION 



Kind of 
Writing 


Scoring 
Method 


Results 
Used By 


Contact 


Narration 
Exposition 
Description 

• 


Primary Trait 


To be specified 


Dr Edward Roeber 
Michigan Dept of Education 
620 Michigan National Tower 
P O Box 30008 
Lansing. Ml 48909 


Narration 
Exposition 
Description 
Persuasion 


Primary Trait 

• 


Statewide 
Reporting 


Dr William McMillian 
State Dept of Education 
Capitol Square. 550 Cedar St 
St Paul, MN 55101 


Exposition 
Description 
Persuasion 


Holistic 


Local Districts 
Schools 
1 etchers 
Parents & 

JIUUvilw 


Dr R Harold Mathers 
State Dept of Ldu cation 
400 West King Street 
Carson City, NV 89701 


Narration 

Exposition 


Holistic 


State 

Department 
Local Districts 


Dr. Junes V Carr 
State Dept. of Education 
64 North Main Street 
Concord, NH 03301 


Narration 


Holistic 


Local Districts 

School 

Teachers 

Nttt Ht*ntc 
JIUUviJIj 


Dr Stephen Koffier 
Department of Education 
225 West Statt Street 
noom ttiu 
Trenion. NJ 08625 


Description 
Persuasion 


H Mistic 


Local Districts 

Schools 

Teachers 


Dr Carroll L Hall 
State Dept of Education 
Education Building 
aanta re. NM 87503 


Unspecified 


Loh»v>° 
Analytical 


To be specified 


Dr William J Brown 
State Dept of Public 
Instruction 
lialeign. NC 2761 1 


Narration 

Exposition 
Description 
Persuasion 


Holistic 


State 

Department 
Local Districts 
Schools 


Mr Jim Payton 
State Dept of Education 
65 South Front Street 
Room 804 

Columbus. OH 43215 


Narration 

Exposition 
Description 
Persuasion 


Holistic 


State 

Department 


R B. Clemmer 
Oregon Dept of Education 
700Pnngie Parkway SE 
Salem, OR 97310 






State 

Department 


Dr Robert Coldiron 
State Dept of Education 
PO Box 911 
Harnsburg. PA 17126 



49 

ERIC 



TESTING METHOD 



State 


Grades 
Tested 


Samplr 

Size 
(X 1000) 


Objective 
Test 


Writing 
Sample 


Exercises 
Developed By 


Rhode Island 


4.6.8.10 


<5 


X 


V 


Contractor 


South 
Carolina 


6.8.11 


>60* 




X 


State 

Department 

University 

Faculty 

Contractor 

Teachers 

Local Districts 


Texas 


3,5.9 




X 


X 


State 

Department 

University 
Faculty 
Contractor 
Teachers 


Wyoming 


6.9 


s <5 




X 


State 

Department 
University 
Faculty 
Teachers 


City 


Lit tit Ruck. 
AR 




Not 

specified 


X 






Phoenix. AZ 


9-12 


5-10 


X 


X 


Teacher 
Local Distncts 


Monterey. CA 


1-12 


<5 




X 


Teacher 


Tallahassee. 

FL 


1-R 


5-10 


X 


X 


University 
Faculty 
Teachers 
Local Distncts 


Atlanta. GA 


1-12 




X 






Des Moines, 
IA 


9 


-5 




X 


Teacher 
Local Districts 



*fcntiit Population Tested 



57 



WRITING SAMPLE DI^SCRIPTION 



Kindt* 
Writtr 


Scoring 
Method 


Results 
Used By 


Contact 


Nam ton 
Persuasion 


Holistic 


Local Districts 

kJl 11 outs 

Teachers 
State 

Department 


Ms Martha Highsmith 
State Dept of Education 
199 Promenade Street 
Suite 204 

Providence. Rl 02908 


Narration 

Exposition 
Description 
Persuasion 


Holistic 
Analytical 


Local Districts 
Schools 

State 

Depar**"-"* 


Dr Vana Meredith 
State Dept of Education 

1/4 OG C_nn>_ Cm • D £f\A 

senate street. Koom oU4 
Columbia. SC 29201 


Narration 

Description 
Persuasion 


H nil* tir 


Local Districts 


Mr Keith L L,ruse 
Texas Edu< ation Agency 
201 East 11th Street 
Austin. TX 78701 


Narration 

Description 


Holistic 


Local Districts 
Schools 


Dr Mark Fox 
Sta eDept of Education 
Hathaway Building 
Cheyenne. WY 82002 








Local Districts 

Schoois 

Teachers 


Dr Carolyn Weddle 
Little Rock School District 
West Markham & Lzard 
Little kock AR 72201 


Narration 

Exposition 
Description 


Analytical 


Schools 
Teachers 


Mr Ge:aloDeGrow 
i-Somix Ul 'S District 2 10 
252') W Osi ornRd 
Phoenix, AZ 35017 


Exposition 
Description 


Holistic 


Schools 
Teachers 


Dr Lloyd Swanson 
Monterey Peninsula Unified 
School District 
PO Box 131 
Monterey. CA 93940 


N» "ration 
Exposition 
Dcscrption 


Analytical 


Local Districts 

Schools 

Teachers 


Mr F W Ashmore 
Leon Co Public Schools 
PO Box 246 
Tallahassee. FL 32302 






Schools 
Teachers 


Mr Alonzo Cnm 
Int School District2<)3 
?24 Central Avenue S W 
Atlanta. GA 30303 


Persuasion 


Holistic 
Analytical 


Schools 
Teacher 


Mr Dwight M Davis 
Dcs Moines lnt Comm Dist 
, 1800 Grand Avenue 
Dcs Moines. IA 50307 



ERIC 



TESTING METHOD 



City 


Grades 
Tested 


Sample 

Size 
(X 1000) 


Objective 
Test 


Writing 

Sum nip 


Exercises 

LrCVvIUJJCU Oj 


Chicago. IL 


9-12 


>60 




X 


Teacher 


Boston. MA 


°,5.8 


5-10 




X 


State 

Department 
Teachers 


Wichita, KS 


K- 12 


10-20 




X 


Teachers 
Coord of L A 


Baltimore 
Ml) 


1-9 




X 


X 


Teachers 
Local Districts 


Detroit. Ml 


10-12 


10-20 


X 


X 


Contractor 
Dept I A 


Raleigh \(, 


1 12 


-5 


X 


X 


Contractor 
Teachers 
Local Districts 


Albuquerque 
NM 


l.b.9-12 


5-10 


X 


X 


State 

Department 
Teachers 


Santa Fe NM 


7-12 


- 5 




X 


Teachers 


New York NY 


8,11 


■ M) 




X 


State 

Department 


Portland. OR 


3-9 


Not 


X 







specified 



5<y 

52 

9 



WRITING SAMPLE DESCRIPTION 



Kind of 
Writing 


Scoring 
Method 


Results 
Used By 


Contact 


Narration 
Exposition 
Description 
Persuasion 


Analytical 


Local Districts 


Mr James Redmond 
Cook Co Public Schools 
228 North La Salle Street 
Chicago, IL 60601 


Narration 
Exposition 
Description 


Holistic 


Local D<stncts 

Schools 

Teachers 


Mr William Leary 

Boston Puhltc School Dist 
1 5 Beacon Street 
Boston. MA 02 108 


Narration 


Holistic 


Local Districts 
Schools 


D« Alvin E Morns 
Wichita Sedgwick Unfd 
Dist 259 
428 S Broadway 
Wichita Falls. KS 67202 


Narration 

Persuasion 


Analytical 


Schools 
Teachers 


Mr Roland Patterson 
Baltimore Co Public Schools 
3E 35th Street 
Baltimore, MD 21218 


Exposition 


Holistic 




Mr Charles Wolfe 
Wayne Co Public Schools 
5057 Woodward 
Detroit, MI 48202 


Narration 

Exposition 
Description 


Teaihei Option 


Schools 
Teachers 
Parents 
Students 


Mr C L Hooper 
Raleigh Dist Public Schools 
601 Devereux St 
Raleigh, NC 27605 


Exposition 
D<-scnprni" 
Persuasion 


Holistic 


Local Districts 
^hools 
Teachers 
State 

Department 
Reported to 
Mt>dia 
Report to 
Student 


Mr E Staple ton 
"er^ahlloCo Publr Schools 
Box 1927 

Al^ , »-que. NM 87103 


Description 


Holts tn 
AnaiytK <*] 

- 


Schools 


Mr Philip Bebo 
Suntal'cCo Public Schools 
blO AltaVista 
Santa le NM 87501 


Exposition 
Persuasion 


Holistic 


Local Districts 
Schools 
Teac hers 
State 

Department 


Mr Calvin E Gross 
New York Citv Schools 
1 10 Livingston Street 
Brooklyn, NY 1 1201 






Local Districts 


Dr Walter Hathaway 






Schools 
Teac hers 


Portland Public Schools 
PO Box 3107 






Parents 


Portland. OR "7208 






Students 





53 



TESTING METHOD 
Sample 

Grades Size Objective Writing Exercises 
ci *y Tested (X 1000) Test Sample Developed By 

Austin. TX 3.9 <5 X X State 

Department 
University 
Faculty 
Contractor 

Teachers 

Madison, WI 5.8.11 <5 X X State 

Department 
University 
Faculty 
Parent/Bus 

People 

Seattle. W A 3.6.9 1 1 <^5 X Teachers 

Curr Specialists 



Laramie. WY b.9 < 5 X Com of local 

and state univ 
members 



54 



ERIC 



SI 



WRITING SAMPLE DESCRIPTION 



Kind of 
Writing 


Scoring 
Method 


Results 
Used By 


Contact 


Narration 
Exposition 
Persuasion 


Holistic 


Local Districts 
Schools 
Teachers 
State 

Department 


Dr Jack Davidson 
Austin. ESQ 
6100 N Guadalupe 
Austin. TX 78752 


Narration 
Exposition 
Persuasion 


Holistic 
Primary Trait 


Not specified 


Mr D S Ritchie 

Dane Co Public Schools 

W Dayton 
Madison. Wl 53703 




Analytical 


Schools 


Mr Forbes Bottomly 
Seattle School Dist 1 
815 Fourth Ave N 
Seattle. WA 98109 


Exposition 
Description 


Holistic 


Local Districts 

Schools 

State 

Department 


Dr Joe Lutjeharms 
Laramie Co Public School 
District 1 

Chcytmne, WY 82001 



/ , . 

55 




